Chapter 4 Corpus Analysis: A Start
In this chapter, I will demonstrate how to do basic corpus analysis after you have collected data. I will show you some of the most common ways that people work with the text data.
4.1 Installing quanteda
There are many packages that are made for computational text analytics in R. You may consult the CRAN Task View: Natural Language Processing for a lot more alternatives.
To start with, this tutorial will use a powerful package, quanteda, for managing and analyzing textual data in R. You may refer to the official documentation of the package for more detail.
quanteda is not included in the default R installation. Please install the package if you haven’t done so.
Also, as noted on the quanteda documentation, because this library compiles some C++ and Fortran source code, you will need to have installed the appropriate compilers.
- If you are using a Windows platform, this means you will need also to install the Rtools software available from CRAN.
- If you are using macOS, you should install the macOS tools.
If you run into any installation errors, please go to the official documentation page for additional assistance.
## [1] '2.0.1'
4.2 Building a corpus from character vector
To demonstrate a typical corpus analytic example with texts, I will be using a pre-loaded corpus that comes with the quanteda package, data_corpus_inaugural. This is a corpus of US presidential inaugural address texts, and metadata for the corpus from 1789 to present.
## Corpus consisting of 58 documents and 4 docvars.
## 1789-Washington :
## "Fellow-Citizens of the Senate and of the House of Representa..."
##
## 1793-Washington :
## "Fellow citizens, I am again called upon by the voice of my c..."
##
## 1797-Adams :
## "When it was first perceived, in early times, that no middle ..."
##
## 1801-Jefferson :
## "Friends and Fellow Citizens: Called upon to undertake the du..."
##
## 1805-Jefferson :
## "Proceeding, fellow citizens, to that qualification which the..."
##
## 1809-Madison :
## "Unwilling to depart from examples of the most revered author..."
##
## [ reached max_ndoc ... 52 more documents ]
## [1] "corpus" "character"
We create a corpus() object with the pre-loaded corpus in quanteda– data_corpus_inaugural:
After the corpus is loaded, we can use summary() to get the metadata of each text in the corpus, including word types and tokens as well. This allows us to have a quick look at the size of the addressess made by all presidents.
require(ggplot2)
corp_us %>%
summary %>%
ggplot(aes(x = Year, y = Tokens, group = 1)) +
geom_line() +
geom_point() +
theme_bw()
Exercise 4.1 Could you reproduce the above line plot and add information of President to the plot as labels of the dots?
ggplot2::geom_text() or more advanced one, ggrepel::geom_text_repel()

4.3 Keyword-in-Context (KWIC)
Keyword-in-Context (KWIC), or concordances, are the most frequently used method in corpus linguistics. The idea is very intuitive: we get to know more about the semantics of a word by examining how it is being used in a wider context.
We can use kwic() to perform a search for a word and retrieve its concordances from the corpus:
kwic() returns a data frame, which can be easily output to a CSV file for later use.
Please note that kwic(), when taking a corpus object as the argument, will automatically tokenize the corpus data and do the keyword-in-context search on a word basis. In other words, the pattern you look for cannot be a linguistic pattern across several words. We will talk about how to extract constructions later. Also, for languages without explicit word boundaries (e.g., Chinese), this may be a problem with quanteda. We will talk more about this in the later chapter on Chinese Texts Analytics.
4.4 KWIC with Regular Expressions
For more complex searches, we can use regular expressions as well in kwic(). For example, if you want to include terror and all its other related word forms, such as terrorist, terrorism, terrors, you can do a regular expression search.
By default, the kwic() is word-based. If you like to look up a multiword combination, use phrase():
It should be noted that the output of kwic includes not only the concordances (i.e., preceding/subsequent co-texts + the keyword), but also the sources of the texts for each concordance line. This would be extremely convenient if you need to refer back to the original discourse context of the concordance line.
kwic() search.

4.5 Tidy Text Format of the Corpus
So far our corpus is a corpus object defined in quanteda. In most of the R standard packages, people normally follow the using tidy data principles to make handling data easier and more effective. As described by Hadley Wickham (Wickham and Grolemund 2017), tidy data has a specific structure:
- Each variable is a column
- Each observation is a row
- Each type of observational unit is a table
With text data like a corpus, we can also define the tidy text format as being a data.frame with one-token-per-row. A token is a meaningful unit of text, such as a word that we are interested in using for analysis, and tokenization is the process of splitting text into tokens.
In computational text analytics, the token (i.e., each row in the data frame) is most often a single word, but can also be an n-gram, sentence, or paragraph. The tidytext package in R is made for the handling of the tidy text format of the corpus data.
Tidy datasets allow manipulation with a standard set of tidy tools, including popular packages such as dplyr, tidyr, and ggplot2.
Figure 4.1: Computational Text Processing Flowchart
The tidytext package includes functions to tidy() objects from quanteda.
library(tidytext)
corp_us_tidy <- tidy(corp_us) # convert `corpus` to `data.frame`
class(corp_us_tidy)## [1] "tbl_df" "tbl" "data.frame"
4.6 Frequency Lists
4.6.1 Word (Unigram)
To get a frequency list of words, word tokenization is an important step for corpus analysis because words are a meaningful linguistic unit in language. Also, word frequency lists are often indicative of many important messages.
The tidytext provides a powerful function, unnest_tokens() to tokenize a data frame with larger linguistic units (e.g., texts) into one with smaller units (e.g., words). That is, the unnest_tokens() convert a text-based data frame (each row is a text document) into a token-based data frame(each row is a token splitted from the text).
corp_us_words <- corp_us_tidy %>%
unnest_tokens(output = word,
input = text,
token = "words") # tokenize the `text` column into `word`
corp_us_words
The unnest_tokens() is optimized for English tokenization of other linguistic units, such as words, ngrams, sentences, lines, and paragraphs (check ?unnest_tokens()). To handle Chinese data, however, we need to define own ways of tokenization unnest_tokens(…, token = …). We will discuss the principles for Chinese text processing in a later chapter.
Please note that by default, token = “words” would normalize the texts to lower-casing letters. Also, all the non-word tokens are automatically removed. If you would like to preserve the casing differences and the punctuations, you can include the following arguments in unnest_tokens(…, token = “words”,strip_punct = F, strip_numeric = F).
Now we can count the word frequencies:
4.6.2 Bigrams
Frequency lists can be generated for bigrams or any other multiword combinations as well:
corp_us_bigrams <- corp_us_tidy %>%
unnest_tokens(output = bigram, input = text, token = "ngrams", n = 2)
corp_us_bigramsTo create bigram frequency list:
## [1] 135562
## [1] 135504
unnest_tokens() does a lot of work behind the scene. Please take a closer look at the outputs of unnest_tokens() and examine how it takes care of the case normalization and punctuations within the sentence. Will these treatments affect the frequency lists we get in any important way? Please elaborate.
4.6.3 Ngrams (Lexical Bundles)
corp_us_trigrams <- corp_us_tidy %>%
unnest_tokens(trigrams, text, token = "ngrams", n = 3)
corp_us_trigramsWe then can examine which n-grams were most often used by each President:
corp_us_trigrams %>%
count(President, trigrams) %>%
group_by(President) %>%
top_n(3, n) %>%
arrange(President, desc(n))Exercise 4.4 Please subset the top 3 trigrams of President Don. Trump, Bill Clinton, John Adams, from corp_us_trigram.
4.6.4 Frequency and Dispersion
When looking at frequency lists, there is another distributional metric we need to consider: dispersion. An n-gram can be meaningful if its frequency is high. However, this high frequency may come in different meanings. What if the n-gram only occurs in ONE particular document, i.e., used only by a particular President? Or alternatively, what if the n-gram appears in many different documents, i.e., used by many different Presidents?
The degrees of n-gram dispersion has a lot to do with the significance of its frequency.
So now let’s compute the dispersion of the n-grams in our corp_us_trigrams. Here we define the dispersion of an n-gram as the number of Presidents who have used the n-gram at least once in his address(es).
# method 1
corp_us_trigrams %>%
count(trigrams, President) %>%
group_by(trigrams) %>%
summarize(FREQ = sum(n), DISPERSION = n()) %>%
filter(DISPERSION >= 5) %>%
arrange(desc(DISPERSION))# method2
corp_us_trigrams %>%
group_by(trigrams) %>%
summarize(FREQ = n(), DISPERSION = n_distinct(President)) %>%
filter(DISPERSION >= 5) %>%
arrange(desc(DISPERSION))# Arrange according to frequency
# corp_us_trigram %>%
# count(trigrams, President) %>%
# group_by(trigrams) %>%
# summarize(freq = sum(n), dispersion = n()) %>%
# arrange(desc(freq))In particular, cut-off values are often determined to select a list of meaningful n-grams. These cut-off values include: the frequency of the n-grams, as well as the dispersion of the n-grams. A subset of n-grams that are defined and selected based on these distributional criteria (i.e., frequency and dispersion) are often referred to as Lexical bundles.
4.7 Word Cloud
With frequency data, we can visualize important words in the corpus with a Word Cloud. It is a novel but intuitive visual representation of text data. It allows us to quickly perceive the most prominent words from a large collection of texts.
library(wordcloud)
set.seed(123)
with(corp_us_words_freq, wordcloud(word, n,
max.words = 400,
min.freq = 20,
scale = c(2,0.5),
color = brewer.pal(8, "Dark2"),
vfont=c("serif","plain")))
Exercise 4.6 Word cloud would be more informative if we first remove functional words. In tidytext, there is a preloaded data frame, stop_words, which contains common English stop words. Please make use of this data frame and try to re-create a word cloud with all stopwords removed. (Criteria: Frequency >= 20; Max Number of Words Plotted = 400)
dplyr::anti_join()

wordcloud2, and re-create a word cloud as requested in Exercise 4.6 but in a fancier format, i.e., a star-shaped one. (Criteria: Frequency >= 20; Max Number of Words Plotted = 400)
4.8 Collocations
With unigram and bigram frequencies of the corpus, we can further examine the collocations within the corpus. Collocation refers to a frequent phenomenon where two words tend to co-occur very often in use. This co-occurrence is defined statistically by their lexical associations.
4.8.1 Cooccurrence Table and Observed Frequencies
Cooccurrence frequency data for a word pair, w1 and w2, are often organized in a contingency table extracted from a corpus, as shown in Figure 4.2. The cell counts of this contingency table are called the observed frequencies O11, O12, O21, and O22.
Figure 4.2: Cooccurrence Freqeucny Table
The sum of all four observed frequencies (called the sample size N) is equal to the total number of bigrams extracted from the corpus. R1 and R2 are the row totals of the observed contingency table, while C1 and C2 are the corresponding column totals. The row and column totals are also called marginal frequencies, being written in the margins of the table, and O11 is called the joint frequency.
4.8.2 Expected Frequencies
Equations for all association measures are given in terms of the observed frequencies, marginal frequencies, and the expected frequencies E11, …, E22. Expected frequencies refer to the expected number of co-occurrences under the null hypothesis that W1 and W2 are statistically independent. The expected frequencies can easily be computed from the marginal frequencies as shown in Figure 4.3.
Figure 4.3: Computing Expected Frequencies
Maybe it would be easier for us to illustrate this with a simple example:
Figure 4.4: Computing Expected Frequencies
How do we compute the expected frequencies of the four cells?
Figure 4.5: Computing Expected Frequencies
example in R.
4.8.3 Association Measures
The idea of lexical assoication is to measure how much the observed frequencies deviate from the expected. Some of the metrics (e.g., t-statistic, MI) consider only the joint frequency deviation (i.e., O11), while others (e.g., G2, a.k.a Log Likelihood Ratio) consider the deviations of ALL cells.
Here I would like to show you how we can compute the most common two asssociation metrics for all the bigrams found in the corpus: t-test statistic and Mutual Information (MI).
- \(t = \frac{O_{11}-E_{11}}{\sqrt{E_{11}}}\)
- \(MI = log_2\frac{O_{11}}{E_{11}}\)
- \(G^2 = 2 \sum_{ij}{O_{ij}log\frac{O_{ij}}{E_{ij}}}\)
corp_us_collocations <- corp_us_bigrams_freq %>%
filter(n > 5) %>% # set bigram frequency cut-off
rename(O11 = n) %>%
tidyr::separate(bigram, c("w1", "w2"), sep="\\s") %>% # split bigrams into two columns
mutate(R1 = corp_us_words_freq$n[match(w1, corp_us_words_freq$word)],
C1 = corp_us_words_freq$n[match(w2, corp_us_words_freq$word)]) %>% # retrieve w1 w2 unigram freq
mutate(E11 = (R1*C1)/sum(O11)) %>% # compute expected freq of bigrams
mutate(MI = log2(O11/E11),
t = (O11 - E11)/sqrt(E11)) %>% # compute associations
arrange(desc(MI)) # sorting
corp_us_collocationsPlease note that in the above example, we compute the lexical associations for bigrams whose frequency > 5. This is necessary in collocation studies because bigrams of very low frequency would not be informative even though its association can be very strong. However, the cut-off value can be arbitrary, depending on the corpus size or researchers’ considerations.
How to compute lexical assoications is a non-trivial issue. There are many more ways to compute the association strengths between two words. Please refer to Stefan Evert’s site for a very comprehensive review of lexical assoication meaasures.
corp_us_collocations according to the t-score and compare the results sorted by MI scores. Please describe what you find.
corp_us_collocations, which gives the Log-Likelihood Ratios of all the bigrams.
When you do the above exercise, you may run into a couple of problems:
-
Some of the bigrams have
NaNvalues in their LLR. This may be due to the issue ofNAs produced by integer overflow. Please solve this. -
After solving the above overflow issue, you may still have a few bigrams with
NaNin their LLR, which may be due to the computation of thelogvalue. In Math, how do we definelog(1/0)andlog(0/1)? Do you know when you would get an undefined valueNaNin the computation oflog()? -
To solve the problems, please assign the value
0if thelogreturnsNaNvalues.
Exercise 4.11
Find the top FIVE bigrams ranked according to MI values for each president. The result would be a data frame as shown below.
- Create a plot as shown below to visualize your results.

4.9 Constructions
We are often interested in the use of linguistic patterns, which are beyond the lexical boundaries. My experience is that usually it is better to work with the corpus on a sentential level.
We can use the same tokenization function, unnest_tokens() to convert our text-based corpus data frame, corpus_us_tidy, into a sentence-based tidy structure:
corp_us_sents <- corp_us_tidy %>%
unnest_tokens(output = sentence,
input = text,
token = "sentences") # tokenize the `text` column into `sentence`
corp_us_sentsWith each sentence, we can investigate particular constructions in more detail. Let’s assume that we are interested in the use of Perfect aspect in English by different presidents. We can try to extract Perfect constructions (including Present/Past Perfect) from each sentence using the regular expression.
Here we make a simple naive assumption: Perfect constructions include all have/has/had + VERB-en/ed tokens from the sentences.
require(stringr)
# Perfect
corp_us_sents %>%
unnest_tokens(perfect,
sentence,
token = function(x) str_extract_all(x, "ha[d|ve|s] \\w+(en|ed)")) -> result_perfect
result_perfect
In the above example, we specify the token= argument in unnest_tokens(…, token = …) with a self-defined function. The idea of tokenization in unnest_tokens() is that the token argument should be a function which takes a text-based vector as input (i.e, each element of the input vector may be a document text) and returns a list, each element of which is a token-based version (i.e., vector) of the original input vector element (see Figure below).
In our demonstration, we define a tokenization function, which takes sentence as the input and returns a list, each element of which consists a vector of tokens matching the regular expressions in individual sentences in sentence. (Note: The function object is not assigned to an object name, thus never being created in the R working session.)
Figure 4.6: Intuition for token= in unnest_tokens()
And of course we can do an exploratory analysis of the frequencies of Perfect constructions by different presidents:
require(tidyr)
# table
result_perfect %>%
group_by(President) %>%
summarize(TOKEN_FREQ = n(),
TYPE_FREQ = n_distinct(perfect))# graph
result_perfect %>%
group_by(President) %>%
summarize(TOKEN_FREQ = n(),
TYPE_FREQ = n_distinct(perfect)) %>%
pivot_longer(c("TOKEN_FREQ", "TYPE_FREQ"), names_to = "STATISTIC", values_to = "NUMBER") %>%
ggplot(aes(President, NUMBER, fill = STATISTIC)) +
geom_bar(stat = "identity",position = position_dodge()) +
theme(axis.text.x = element_text(angle=90))
There are quite a few things we need to take care of more thoroughly:
The auxilliary HAVE and the past participle do not necessarily have to stand next to each other for Perfect constructions.
We now lose track of one important information: from which sentence of the Presidental addressess did we collect each Perfect constructional token?
Any ideas how to solve all these issues?
Exercise 4.12 Please create a better regular expression to retrieve more tokens of English Perfect constructions, where the auxilliary and participle may not stand together.
Exercise 4.13 Re-generate a result_perfect data frame, where you can keep track of:
- From the N-th sentence of the address did the Perfect come? (e.g.,
SENT_IDcolumn below) - From which president’s address did the Perfect come? (e.g.,
INDEXcolumn below)

References
Wickham, Hadley, and Garrett Grolemund. 2017. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 1st ed. O’Reilly Media, Inc.